[return to overview page]
Before beginning attempting to build models to predict the truth or falsity of statements, we need to do some house-keeping. We need to join together, examine, pre-process and clean the sets of features we created in the earlier sections. That is what I do in this section.
Again, I will start by loading relevant packages.
# before knitting: message = FALSE, warning = FALSE
library(tidyverse) # cleaning and visualization
library(plyr) # has join_all function
library(ggthemes) # visualization
library(caret) # modeling
library(AppliedPredictiveModeling)
library(e1071) # has skewness() function
library(DescTools) # has Winsorize() function
First, I will load the various data from the features we just extracted.
# load all the nice tidy df's of features we created (remember stats_words has multiple dtm's)
load("stats_clean.Rda") # has text of statement
load("stats_length.Rda") # statement lengths
load("stats_pos.Rda") # parts of speech
load("stats_sent.Rda") # sentiment
load("stats_complex.Rda") # complexity and readability
load("stats_words.Rda") # bag of words (mini document-term matrices)
To begin, let’s take all the feature sets we created (statement length, parts of speech, sentiment, readability, and word frequency) and put them together. We can join together these different feature sets by linking individual statements together by their individual statement identification number (“stat_id” column), which we made sure to attach and keep constant throughout the feature extraction process. (This can be done easily using SQL-style join functions available through various R packages, particuly in the tidyverse. For background on SQL joins, which are a super useful and universal theme in data management: Join (SQL), 2018.)
# join all features with ground truth, by stat_id
stats_all <-
join_all(dfs = list(stats_length,
stats_pos,
stats_sent,
stats_complex,
stats_dtm_100),
by = "stat_id",
type = "left") %>%
left_join(stats_clean %>%
select(stat_id,
grd_truth),
by = "stat_id") %>%
select(stat_id,
grd_truth,
everything())
# print joined df
stats_all
Before engaging in any predictive modeling, it’s important to make sure that there are no major underlying problems in that data. As Kuhn & Johnson (2013, p. 27) note “data preparation can make or break a model’s predictive ability”. They outline a few key issues to check for when cleaning or pre-processing data. The ones that are particular relevant to us are
I am going to check for each of these problems and adjust (i.e. delete or transform) data in cases where any major problems seem to arise. Later, when we actually build predictive models, we can compare the performance of models that use this cleaned data compared to models that use the raw data.
To get an overall sense of our variables, let’s take a look at their distributions. This is plotted below (one distribution for each of our 124 variables; different categories of features are delineated in different colors). when we do this, we see there is a fair deal of skew in many of our variables. And some variables do not seem to have much variance (e.g. the majority of the data points are the same value). The outlying points in these skewed distributions may exert a high degree of influence on some of our model’s estimates. And the variables with zero or near zero variance simply won’t be very useful for prediction. Let’s go about adjusting our data a bit to account for this.
stats_all %>%
select(-stat_id) %>%
keep(is.numeric) %>%
gather(key = "variable",
value = "value") %>%
mutate(feature_type = case_when(grepl("^n_", variable) ~ "length",
grepl("^pos_", variable) ~ "part_of_speech",
grepl("^sent_", variable) ~ "sentiment",
grepl("^read_", variable) ~ "complexity",
grepl("^wrd_", variable) ~ "word")) %>%
ggplot(aes(x = value,
fill = feature_type)) +
facet_wrap(~ variable,
ncol = 7,
scales = "free") +
geom_histogram() +
scale_fill_discrete(breaks=c("length", "part_of_speech", "complexity",
"sentiment", "word")) +
theme(legend.position="top")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Kuhn & Johnson, (2013) suggest a fair number of popular ways of dealing with skewed data. One common solution is to apply different types of transformations to the data, depending on the direction of the skew. One very popular method (that also has the advantage of minimizing thinking) is to apply the Box & Cox (1964) transformations, which are entire family of transformations that adjust various skewed distributions in pre-defined ways according to their skew. However, I don’t love these transformations because they reduce the interpretability of coefficients (e.g. some variables might be squared, others raised to the power -1). Further, many of our distributions are simply poisson-looking count distributions, with most variables occuring in a few different count bins – various transformations aren’t really going have much of an effect on the distributions between these bins, just the spacing between them. What I am more worried about are those distributions where a few extreme outliers are significantly skewing the distributions (e.g. the distributions for “n_words”, which counts the number of words in each statement). A nicer and simpler type of transformation exists that eliminates these outliers, without needing to change the scales, and thus leaving our variables in interpretable units. This is to simply select some cut points at the ends of the distribution (e.g. at 5% and 95%) and make those the lowest and highest possible values (i.e. any values below the 5th percentile are transformed into whatever the 5th percentile value is, and any values above the 95th percentile value are transformed into whatever the 95 percentile value is). This method is called Winsorizing (Winsorizing, 2019).
Let me demonstrate what the this look like on one variable, which appears highly skewed (the “n_words” count, referenced earlier). Visualized below are the raw distribution and the winsorized distribution (windsorized at 1% and 99%). As we can see, this massively reduces the skew in the distribution.
# make data frame to store columns for raw and winsorized "n_words" column
winsor_exmpl <- as.data.frame(stats_all$n_words)
winsor_exmpl <- winsor_exmpl %>% dplyr::rename(n_words_raw = 1)
# winsorize the variables (at 1% and 99%)
winsor_exmpl$n_words_winsorized <-
Winsorize(stats_all$n_words,
probs = c(0.01, 0.99))
# visually compare raw and winsorized distributions
winsor_exmpl %>%
select(n_words_raw,
n_words_winsorized) %>%
gather(key = "transform",
value = "value") %>%
ggplot(aes(x = value)) +
facet_wrap(~ transform,
scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To apply this transformation across all variables, we have to have some method for deciding which variables are skewed enough that they warrant winsorization. We could decide this based on visual inspection of the individual histograms above and personal judgment. However, there are a lot of these. And there exist other metrics for quantifying skew – kurtosis, skew, and other rules of thumb (e.g. the ratio between the max and min value for a variable should be no more than 20) (Kuhn & Johnson, 2013, Chapter 3). I am going to simply choose the metric called “skewness” (there’s no obvious reason why one metric is better than another). The formula for this metric is shown below, and some rules of thumb suggest that we should worry about distributions with skewness values greater than 2. So, we’ll winsorize all variables that have a skewness with an absolute value greater than 2.
# find skew of each variable
allvar_skewness <-
as.data.frame(
lapply(subset(stats_all, select = -c(stat_id, grd_truth)),
e1071::skewness)) %>%
gather(key = "feature",
value = "skew") %>%
arrange(desc(skew))
# get names of all variables that have absolute skew value of more than 2
allvar_skewness %>%
filter(abs(skew) > 2)
# winsorize all those variables with an absolute skew value more than 2
as.data.frame(
lapply(subset(stats_all, select = -c(stat_id, grd_truth)),
Winsorize))
Another type of issue in our data we might want to deal with is those variables that exhibit zero or near zear variances (with almost all data points clustered at one value). The caret package has a built in function, nearZeroVar, which helps us automatically identify such variables. (It does this by identifying variables where either all values are the same, or the ratio of the first most frequent value to the second most frequent value is beyong some specified threshold).
nearZeroVar(stats_all,
saveMetrics = TRUE) %>%
arrange(freqRatio)
A problem that might affect models like logistic regression is collinearity. A high degree of correlation between our variables may lead to unstable and unreliable estimates. Kuhn & Johnson, (2013, p. ) suggest an algorithm for diagnosing and eliminating problems of collinearity.
# select only predictor variables (i.e. get rid of stat_id and grd_truth columns)
stats_allclean_x <- subset(stats_allclean,
select = -c(stat_id, grd_truth))
# find high correlations
high_corr_indices <- findCorrelation(x = cor(stats_allclean_x),
cutoff = 0.75)
# find names of those columns with high correlations
high_corr_names <- names(stats_allclean_x)[high_corr_indices]
# eliminate high correlation columns from cleaned dataset
stats_allclean <-
stats_allclean %>%
select(-high_corr_names)
Finally, I will transform the data to ease interpretation and allow for better comparison in later analyses. Specifically, I am going to center and rescale all predictor variables (i.e. features we previously extracted). First, I will center all variables around their mean – i.e. for each variable, compute its mean and then subtract that value from each row. In this way, we will know that any value which approaches zero will be the near the mean for that variable. Second, I will rescale each variable by its standard deviation – i.e. divide the values in each column by the standard deviation of that column, so that deviations between variables are comparable. Thus, for any variable, a value like 0.5 will mean the same thing – that value is 0.5 standard deviations above the mean for that feature. Luckily, the preProcess function in the caret package (Kuhn, 2008) makes this extremely easy.
# center and scale with preProcess and predict from caret
stats_allclean <-
predict(newdata = stats_all,
preProcess(x = subset(stats_all,
select = -c(stat_id, grd_truth)),
method = c("center", "scale")))
# print centered and scales data frame
stats_allclean
# save the raw and various cleaned data objects
save(stats_all,
stats_allclean,
file = "stats_all.Rda")
Now, let’s save the various objects created.
Box, G. E., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211-243.
Join (SQL). (2018). In Wikipedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Join_(SQL)&oldid=868927067
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1-26.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (Vol. 26). Springer.
Winsorizing. (2019). In Wikipedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Winsorizing&oldid=891819633